23 research outputs found

    Finding Similar Documents Using Different Clustering Techniques

    Get PDF
    AbstractText clustering is an important application of data mining. It is concerned with grouping similar text documents together. In this paper, several models are built to cluster capstone project documents using three clustering techniques: k-means, k-means fast, and k-medoids. Our datatset is obtained from the library of the College of Computer and Information Sciences, King Saud University, Riyadh. Three similarity measure are tested: cosine similarity, Jaccard similarity, and Correlation Coefficient. The quality of the obtained models is evaluated and compared. The results indicate that the best performance is achieved using k-means and k-medoids combined with cosine similarity. We observe variation in the quality of clustering based on the evaluation measure used. In addition, as the value of k increases, the quality of the resulting cluster improves. Finally, we reveal the categories of graduation projects offered in the Information Technology department for female students

    Machine learning approaches in COVID-19 diagnosis, mortality, and severity risk prediction: A review

    No full text
    The existence of widespread COVID-19 infections has prompted worldwide efforts to control and manage the virus, and hopefully curb it completely. One important line of research is the use of machine learning (ML) to understand and fight COVID-19. This is currently an active research field. Although there are already many surveys in the literature, there is a need to keep up with the rapidly growing number of publications on COVID-19-related applications of ML. This paper presents a review of recent reports on ML algorithms used in relation to COVID-19. We focus on the potential of ML for two main applications: diagnosis of COVID-19 and prediction of mortality risk and severity, using readily available clinical and laboratory data. Aspects related to algorithm types, training data sets, and feature selection are discussed. As we cover work published between January 2020 and January 2021, a few key points have come to light. The bulk of the machine learning algorithms used in these two applications are supervised learning algorithms. The established models are yet to be used in real-world implementations, and much of the associated research is experimental. The diagnostic and prognostic features discovered by ML models are consistent with results presented in the medical literature. A limitation of the existing applications is the use of imbalanced data sets that are prone to selection bias

    Incremental Ant-Miner Classifier for Online Big Data Analytics

    No full text
    Internet of Things (IoT) environments produce large amounts of data that are challenging to analyze. The most challenging aspect is reducing the quantity of consumed resources and time required to retrain a machine learning model as new data records arrive. Therefore, for big data analytics in IoT environments where datasets are highly dynamic, evolving over time, it is highly advised to adopt an online (also called incremental) machine learning model that can analyze incoming data instantaneously, rather than an offline model (also called static), that should be retrained on the entire dataset as new records arrive. The main contribution of this paper is to introduce the Incremental Ant-Miner (IAM), a machine learning algorithm for online prediction based on one of the most well-established machine learning algorithms, Ant-Miner. IAM classifier tackles the challenge of reducing the time and space overheads associated with the classic offline classifiers, when used for online prediction. IAM can be exploited in managing dynamic environments to ensure timely and space-efficient prediction, achieving high accuracy, precision, recall, and F-measure scores. To show its effectiveness, the proposed IAM was run on six different datasets from different domains, namely horse colic, credit cards, flags, ionosphere, and two breast cancer datasets. The performance of the proposed model was compared to ten state-of-the-art classifiers: naive Bayes, logistic regression, multilayer perceptron, support vector machine, K*, adaptive boosting (AdaBoost), bagging, Projective Adaptive Resonance Theory (PART), decision tree (C4.5), and random forest. The experimental results illustrate the superiority of IAM as it outperformed all the benchmarks in nearly all performance measures. Additionally, IAM only needs to be rerun on the new data increment rather than the entire big dataset on the arrival of new data records, which makes IAM better in time- and resource-saving. These results demonstrate the strong potential and efficiency of the IAM classifier for big data analytics in various areas

    Towards Accurate Children’s Arabic Handwriting Recognition via Deep Learning

    No full text
    Automatic handwriting recognition has received considerable attention over the past three decades. Handwriting recognition systems are useful for a wide range of applications. Much research has been conducted to address the problem in Latin languages. However, less research has focused on the Arabic language, especially concerning recognizing children’s Arabic handwriting. This task is essential as the demand for educational applications to practice writing and spelling Arabic letters is increasing. Thus, the development of Arabic handwriting recognition systems and applications for children is important. In this paper, we propose two deep learning-based models for the recognition of children’s Arabic handwriting. The proposed models, a convolutional neural network (CNN) and a pre-trained CNN (VGG-16) were trained using Hijja, a recent dataset of Arabic children’s handwriting collected in Saudi Arabia. We also train and test our proposed models using the Arabic Handwritten Character Dataset (AHCD). We compare the performance of the proposed models with similar models from the literature. The results indicate that our proposed CNN outperforms the pre-trained CNN (VGG-16) and the other compared models from the literature. Moreover, we developed Mutqin, a prototype to help children practice Arabic handwriting. The prototype was evaluated by target users, and the results are reported

    Early Detection of Red Palm Weevil, Rhynchophorus ferrugineus (Olivier), Infestation Using Data Mining

    No full text
    In the past 30 years, the red palm weevil (RPW), Rhynchophorus ferrugineus (Olivier), a pest that is highly destructive to all types of palms, has rapidly spread worldwide. However, detecting infestation with the RPW is highly challenging because symptoms are not visible until the death of the palm tree is inevitable. In addition, the use of automated RPW weevil identification tools to predict infestation is complicated by a lack of RPW datasets. In this study, we assessed the capability of 10 state-of-the-art data mining classification algorithms, Naive Bayes (NB), KSTAR, AdaBoost, bagging, PART, J48 Decision tree, multilayer perceptron (MLP), support vector machine (SVM), random forest, and logistic regression, to use plant-size and temperature measurements collected from individual trees to predict RPW infestation in its early stages before significant damage is caused to the tree. The performance of the classification algorithms was evaluated in terms of accuracy, precision, recall, and F-measure using a real RPW dataset. The experimental results showed that infestations with RPW can be predicted with an accuracy up to 93%, precision above 87%, recall equals 100%, and F-measure greater than 93% using data mining. Additionally, we found that temperature and circumference are the most important features for predicting RPW infestation. However, we strongly call for collecting and aggregating more RPW datasets to run more experiments to validate these results and provide more conclusive findings

    Molecular Characterization of <i>Sarcocystis</i> Species Isolated from Sheep and Goats in Riyadh, Saudi Arabia

    No full text
    Sarcocystosis is induced by species of Sarcocystis, which is an intracellular protozoan parasite in the phylum Apicomplexa. The diversity and importance of Sarcocystis species in sheep and goats in Saudi Arabia are poorly understood. In this study, the tongue, esophagus, heart, diaphragm, and skeletal muscles were collected from 230 sheep and 84 goats, and the tissues were examined for the presence of Sarcocystis species by macroscopic examination and light microscopy. Microscopic Sarcocystis species cysts were found in both sheep and goats. Transmission electron microscopy (TEM) revealed S. tenella in sheep and S. capracanis in goats. Sarcocystis species were confirmed for the first time in Saudi Arabian sheep and goats by molecular testing. S. capracanis was most closely related to S. tenella, with the COX1 sequences sharing 91.7% identity. A phylogenetic analysis produced similar results and indicated that the Sarcocystis isolates were within a group of Sarcocystis species in which dogs were the final host. Finally, the Sarcocystis species cysts from sheep and goats could be grouped together, indicating that they were strongly related

    Molecular Identification of Trypanosoma evansi Isolated from Arabian Camels (Camelus dromedarius) in Riyadh and Al-Qassim, Saudi Arabia

    No full text
    We analyzed the blood from 400 one-humped camels, Camelus dromedarius (C. dromedarius), in Riyadh and Al-Qassim, Saudi Arabia to determine if they were infected with the parasite Trypanosoma spp. Polymerase chain reaction (PCR) targeting the internal transcribed spacer 1 (ITS1) gene was used to detect the prevalence of Trypanosoma spp. in the camels. Trypanosoma evansi (T. evansi) was detected in 79 of 200 camels in Riyadh, an infection rate of 39.5%, and in 92 of 200 camels in Al-Qassim, an infection rate of 46%. Sequence and phylogenetic analyses revealed that the isolated T. evansi was closely related to the T. evansi that was detected in C. dromedarius in Egypt and the T. evansi strain B15.1 18S ribosomal RNA gene identified from buffalo in Thailand. A BLAST search revealed that the sequences are also similar to those of T. evansi from beef cattle in Thailand and to T. brucei B8/18 18S ribosomal RNA from pigs in Nigeria

    Evolutionary Algorithm with Deep Auto Encoder Network Based Website Phishing Detection and Classification

    No full text
    Website phishing is a cyberattack that targets online users for stealing their sensitive data containing login credential and banking details. The phishing websites appear very similar to their equivalent legitimate websites for attracting a huge amount of Internet users. The attacker fools the user by offering the masked webpage as legitimate or reliable for retrieving its important information. Presently, anti-phishing approaches necessitate experts to extract phishing site features and utilize third-party services for phishing website detection. These techniques have some drawbacks, as the requirement of experts for extracting phishing features is time consuming. Many solutions for phishing websites attack have been presented, such as blacklist or whitelist, heuristics, and machine learning (ML) based approaches, which face difficulty in accomplishing effectual recognition performance due to the continual improvements of phishing technologies. Therefore, this study presents an optimal deep autoencoder network based website phishing detection and classification (ODAE-WPDC) model. The proposed ODAE-WPDC model applies input data pre-processing at the initial stage to get rid of missing values in the dataset. Then, feature extraction and artificial algae algorithm (AAA) based feature selection (FS) are utilized. The DAE model with the received features carried out the classification process, and the parameter tuning of the DAE technique was performed using the invasive weed optimization (IWO) algorithm to accomplish enhanced performance. The performance validation of the ODAE-WPDC technique was tested using the Phishing URL dataset from the Kaggle repository. The experimental findings confirm the better performance of the ODAE-WPDC model with maximum accuracy of 99.28%
    corecore